import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report,roc_auc_score
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.decomposition import PCA
from sklearn.svm import SVC
vehicledf=pd.read_csv("../Unsupervised_project_py/vehicle-1.csv")
vehicledf.head(20)
vehicledf.shape
vehicledf.dtypes
# Class is the categorical variable that needs to be label encoded
# 18 other variables are numeric independant variables
# Missing data values are marked NaN values.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lbl = LabelEncoder()
columns = vehicledf.columns
#Let's Label Encode our class variable:
print(columns)
vehicledf['class'] = lbl.fit_transform(vehicledf['class'])
#count the null values NaN values in the columns
vehicledf.isna().sum()
Insights:
from sklearn.impute import SimpleImputer
newdf = vehicledf.copy()
X = newdf.iloc[:,0:19] #separting all numercial independent attribute
#y = vehdf.iloc[:,18] #seprarting class attribute.
imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)
#fill missing values with mean column values
transformed_values = imputer.fit_transform(X)
column = X.columns
newdf = pd.DataFrame(transformed_values, columns = column )
newdf.head(20)
newdf.describe()
newdf.isna().sum()
#As we can see there are no null values now we have already replced them with mean values
newdf.describe().T
#Compactness has mean and median value similar that means it is normally distributed and has no skew/outliers
#Circularity: this also has same mean and median , seems to be normally distributed
#scatter_ratio: mean is higher than median so it is postive skewed
#scaled_variance and scaled_variance.1: mean is higher than median so they are positive/right skewed
#hollows_ratio: mean is less than median so it's little negative/left skewed and has outliers
newdf.hist(bins=20,figsize=(60,40),color='lightblue')
plt.show()
#Most of the data attributes are normally distributed
#Circularity seems to have multiple peaks and almost normally disctributed
#hollow_ratio is left skewed as stated earlier
#scaled valriance,scaled valriance 1 and skewness about 1 and 2, scatter_ratio are right skewed
#pr.axis_rectangularity seems to be have outliers as there are some gaps found in the bar plot.
newdf.skew()
#Some of them are right skewed while only hollow ratio is left skewed
newdf.plot(kind= 'box' , subplots=True,layout=(5,4), sharex=False, sharey=False, figsize=(20,15))
plt.show()
#Insights:
#radius_ratio,pr.axis_aspect_ratio, skewness_about, max_length_aspect_ratio,scaled_radius_of_gyration.1, scaled_variance.1, skewness_about, skewness_about_1, scaled_variance.1 are with outliers.
#skewness_about.1 and scaled_variance have only one outlier
#To view the distribution of categorical variable, draw a countplot
sns.countplot(newdf['class'],label='Count')
#Maximum no of vehicles are identified as Car
newdf1=newdf.drop('class',axis=1) #dropping the class attribute
corr=newdf.corr()
plt.figure(figsize=(20,15))
sns.heatmap(corr, annot=True,linewidths=2)
Insights: Strong correlation:
- Scaled variance.1 seems to be highly correlated with value of 0.99 with pr.axis_recatngularity and scatter_ratio
- pr.axis_recatngularity & scatter_ratio seems to be strongly correlated with value of 0.99
- Scaled Variance & Scaled Variance.1 seems to be strongly correlated with value of 0.95
- skewness_about_2 and hollow_ratio seems to be strongly correlated, corr coeff: 0.89
- ditance_circularity and radius_ratio seems to have high positive correlation with corr coeff: 0.81
- compactness & circularity , radius_ratio & pr.axis_aspect_ratio also seems ver averagely correlated with coeff: 0.67.
- scaled _variance and scaled_radius_of_gyration, circularity & distance_circularity also seems to be highly correlated with corr coeff: 0.79
- pr.axis_recatngularity and max.length_recatngularity also seems to be strongly correlated with coeff: 0.81
- scatter_ratio and elongatedness seems to be have strong negative correlation val : -0.97
- elongatedness and pr.axis_rectangularity seems to have strong negative correlation, val: -0.95
Weak or no correlation:
- skewness_about and scatter_ration are not correlated
- scaled variance and max_length_aspect_ratio have very less correlation(0.32)
- max_length_aspect_ratio & radius_ratio have average correlation with coeff: 0.45
- pr.axis_aspect_ratio & max_length_aspect_ratio seems to have very little correlation(0.16)
- scaled_radius_gyration & scaled_radisu_gyration.1 seems to be very little correlated
- scaled_radius_gyration.1 & skewness_about seems to be very little correlated
- skewness_about & skewness_about.1 not be correlated
- skewness_about.1 and skewness_about.2 are not correlated.
sns.pairplot(newdf1, diag_kind="kde")
#Insights:
#Maximum we can see three clusters here
#most of the attributes have linear relationship
#Scaled Variance & Scaled Variance.1 have strong linear relationship
#pr.axis_recatngularity and Scaled Variance.1/Scaled Variance have strong linear relationship
#skewness_about_2 and hollow_ratio also seems to have strong positive correation or linear relationship
#scatter_ratio and elongatedness have inverse linear relationship
#elongatedness and pr.axis_rectangularity seems to have strong inverse linear relationship
From above correlation matrix we can see that there are many features which are highly correlated then there is no point of using both the highly correlated features in that case we can drop one feature to avoid multicollinearity to happens. So we will find all the features which having more than 0.9 correlation.as we can decide to get rid of those columns whose correlation is +-0.9 or above.Following are the columns:
max.length_rectangularity scaled_radius_of_gyration skewness_about.2 scatter_ratio elongatedness pr.axis_rectangularity scaled_variance scaled_variance.1
Conclusion:we have more than 50% of highly correlated attributes so we need to deal with this problem of Multicollinearity
#now separate the dataframe into dependent and independent variables
X = newdf.drop('class',axis=1)
y = newdf['class']
from sklearn.preprocessing import StandardScaler
#We transform (centralize) the entire X (independent variable data) to normalize it using standardscalar through transformation. We will create the PCA dimensions
# on this distribution.
sc = StandardScaler()
X_std = sc.fit_transform(X)
X_train,X_test,y_train,y_test = train_test_split(X_std,y,test_size=0.30,random_state=1)
#Covariance matrix
cov_matrix = np.cov(X_std.T)
print("cov_matrix shape:",cov_matrix.shape)
print("Covariance_matrix",cov_matrix)
#Calculate Eigen Vectors & Eigen Values
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
#Sort eigenvalues in descending order
eig_pairs=[(eigenvalues[index],eigenvectors[:,index]) for index in range(len(eigenvalues))]
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eigenvalues))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eigenvalues))]
# print out sorted eigenvalues
print('Eigen values in descending order: \n%s' %eigvalues_sorted)
tot = sum(eigenvalues) #total sum of eigen values
var_explained = [(i / tot) for i in sorted(eigenvalues, reverse=True)] # an array of variance explained by each
# eigen vector... there will be 18 entries as there are 18 eigen vectors)
cum_var_exp = np.cumsum(var_explained) # an array of cumulative variance. There will be 18 entries with 18th entry
plt.bar(range(1,19), var_explained, alpha=0.5, align='center', label='individual explained variance')
plt.step(range(1,19),cum_var_exp, where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()
#From the above plot we can clearly observe that 8 dimensions are able to explain almost 95% variance of data
#so we will use first 8 principal components going forward and calulate the reduced dimensions.
#from the above analysis, we can see that 8 dimensions can explain almost 95% of variance in the original data
#Here we will anlysis our data for reduced mathmatical space with 8 dimesions
#Reduce columns from 18 to 8 dimesions
d_reduce=np.array(eigvectors_sorted[0:8])
#projecting original data into principal component dimensions
X_std_8D=np.dot(X_std,d_reduce.T)
reduced_pca=pd.DataFrame(X_std_8D) #converting array to dataframe
reduced_pca
sns.pairplot(reduced_pca, diag_kind='kde')
#As we can see from the above pair plot that applying thed dimesion reduction using PCA, our attributes has become independant with no correlation
#as most of attributes have cloud of data points instead of no linear relationship
#PCA Data with 8 dimesions
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(reduced_pca,y,test_size=0.30,random_state=1)
svc=SVC()
svc.fit(X_train,y_train) #fit svc model on original training data
y_predict=svc.predict(X_test) #predict y values for test data
svc1=SVC()
svc1.fit(pca_X_train,pca_y_train) #fit svc model on PCA training data
pca_y_predict=svc1.predict(pca_X_test)
#display accuracy scores for both models
print("Model score on Original data",svc.score(X_test,y_test))
print("Model score on reduced PCA dimesion",svc1.score(pca_X_test,pca_y_test))
print("Accuracy before PCA on original 18 dimesions", accuracy_score(y_test,y_predict))
print("Accuracy after PCA on reduced 8 dimesions", accuracy_score(pca_y_test,pca_y_predict))
#Insights:
#On testing data with support vector classifier without performaing PCA we are getting accuracy of 95%
#But when we applied SVC classifier on the PCA components then accuracy scored to 93%
#from 18 dimesions to reduced 8 dimesions, our model has scored well in terms of accuracy
# Calculate Confusion Matrix & PLot To Visualize it
def draw_confmatrix(y_test, y_predict, str1, str2, str3, datatype ):
#Make predictions and evalute
cm = confusion_matrix( y_test, y_predict, [0,1,2] )
print("Confusion Matrix For :", "\n",datatype,cm )
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [str1, str2,str3] , yticklabels = [str1, str2,str3] )
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
#Confusion matrix for original data
draw_confmatrix(y_test,y_predict,"Van","Car","Bus","Original Data Set")
#Our model has correctly classified 58 van out of 59 actual vans and has only one error in preicting it as bus
#For 133 actual cars, our model has correctly predicted 129 cars whereas only 4 predicted wrong
#Out of 62 actual bus, model has predicted 56 bus correctly while only 1 bus classifed wrong as Car and 6 bus to van
draw_confmatrix(pca_y_test, pca_y_predict,"Van ", "Car ", "Bus", "For Reduced Dimensions Using PCA ")
#With PCA/reduced dimesion, Our model has correctly classified 57 van out of 59 actual vans and has only two classified wrong as one car and one bus
#For 133 actual cars, our model has correctly predicted 126 cars whereas it predicted wrong 5 cars to bus amd 2 cars to van
#Out of 62 actual bus, model has predicted 56 bus correctly while it wrongly classified 2 bus to Car and 5 bus to van
#original data
print("Classification Report For Raw Data:", "\n", classification_report(y_test,y_predict))
#Our model has best precision and recall score of 99% and 97% to classify the car from given set of shilhoutte parameters.
#It has 98% recall(best) and lowest precision 89% when to classify the van , while lowest recall of 89% and 93% recall for bus
#Weighted avg is coming 95% for all classification metrics
#Accuracy is 95%
#Model built on Principal Components:
print("Classification Report For PCA:","\n", classification_report(pca_y_test,pca_y_predict))
#Classification metrics on reduced dimesions after PCA:
#Our model has best precision score of 98% and recall of 95% to classify the car from given set of shilhoutte parameters.
#It has 97% recall(best) and lowest precision 89% when to classify the van , while lowest recall of 89% and 90% recall for bus
#Weighted avg is coming 94% for all classification metrics
#Accuracy is 94%
def hypertune_classifier(name,model,param_grid,x_train,y_train,x_test,y_test,CV):
CV_rf = GridSearchCV(estimator=model, param_grid=param_grid, cv=CV, verbose= 1, n_jobs =-1 )
CV_rf.fit(x_train, y_train)
y_pred_train = CV_rf.predict(x_train)
y_pred_test = CV_rf.predict(x_test)
print('Best Score: ', CV_rf.best_score_)
print('Best Params: ', CV_rf.best_params_)
#Classification Report
print(name+" Classification Report: ")
print(classification_report(y_test, y_pred_test))
#Confusion Matrix for test data
draw_confmatrix(y_test, y_pred_test,"Van", "Car", "Bus", "Original Data Set" )
print("SVM Accuracy Score:",round(accuracy_score(y_test, y_pred_test),2)*100)
#Training on SVM Classifier
from sklearn.model_selection import GridSearchCV
svmc = SVC()
#Let's See What all parameters one can tweak
print("SVM Parameters:", svmc.get_params())
# Create the parameter grid based on the results of random search
param_grid = [
{'C': [0.01, 0.05, 0.5, 1], 'kernel': ['linear']},
{'C': [0.01, 0.05, 0.5, 1], 'kernel': ['rbf']},
]
param_grid_1 = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
#With first set of parameters :Iteration 1
hypertune_classifier ("Support Vector Classifier",svmc,param_grid,pca_X_train,pca_y_train,pca_X_test,pca_y_test,10)
#With second set of parameters :Iteration 2
hypertune_classifier ("Support Vector Classifier",svmc,param_grid_1,pca_X_train,pca_y_train,pca_X_test,pca_y_test,10)
Insights:
-Here we tune the important hyperparameters of the model which are not model parameter to improve the model performance like we play with C value and type of kernal where
C: Inverse of regularization strength- smaller values of C specify stronger regularization.
Kernal type: rbf or linear
-GridSearchCV get's the best hyperparamters from the array of parameters specified and evaluates each model to find the best model and score for us.
- After the grid search hyper tuning of SVM model we got the best score from Iteration 2
Best Score: 0.9341525423728815 :
Best Params: {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
Accruacy Score : 95 % which increased from 94% to 95% with best params (c:1000,kernal:rbf)
-precision has jumped from 90% to 94% to classify the bus
-recall has jumped from 89% to 94% to classify the bus
#With first set of parameters :Iteration 1
hypertune_classifier ("Support Vector Classifier",svmc,param_grid,X_train,y_train,X_test,y_test,10)
Best Score: 0.9611016949152542 Best Params: {'C': 1, 'kernel': 'rbf'}
#With second set of parameters :Iteration 2
hypertune_classifier ("Support Vector Classifier-iteration2",svmc,param_grid_1,X_train,y_train,X_test,y_test,10)
Insights:
-Here we tune the important hyperparameters of the model which are not model parameter to improve the model performance like we play with C value and type of kernal where
C: Inverse of regularization strength- smaller values of C specify stronger regularization.
Kernal type: rbf or linear
-GridSearchCV get's the best hyperparamters from the array of parameters specified and evaluates each model to find the best model and score for us.
- After the grid search hyper tuning of SVM model we got the best score from Iteration 2
Best Score: 0.9662429378531072 :
Best Params: {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
Accruacy Score : 96 % which increased from 95% to 96% with best params (c:1000,kernal:rbf)
-precision has jumped from 93% to 97% to classify the bus
-recall has jumped from 89% to 94% to classify the bus
-precision has jumped from 89% to 95% to classify the Van while reduced from 99% to 97% to classify the Car
#We can see a slight improvement in model accuracy:95%
#We analysed that how prinicpal components helped us to pick only the relevant dimesions that covers the major information by
#analysing the relationship between independant attributes
#Also apply scaling or normalise your data to perform better
#Fine tune the model using the hyperparameters tuning techniques which tunes the model performances and also employs Cross-fold validation internally to make sure our model is ready to face production environment.
# Also we saw that we are able to achieve the accuracy of 95% with only 8 attributes instead of 18 so we can say that PCA is great tool to make model perform better